Correspondence Analysis on Generalised Aggregated Lexical Tables (CA-GALT) in the FactoMineR Package

نویسنده

  • François Husson
چکیده

Correspondence analysis on generalised aggregated lexical tables (CA-GALT) is a method that generalizes classical CA-ALT to the case of several quantitative, categorical and mixed variables. It aims to establish a typology of the external variables and a typology of the events from their mutual relationships. In order to do so, the influence of external variables on the lexical choices is untangled cancelling the associations among them, and to avoid the instability issued from multicollinearity, they are substituted by their principal components. The CaGalt function, implemented in the FactoMineR package, provides numerous numerical and graphical outputs. Confidence ellipses are also provided to validate and improve the representation of words and variables. Although this methodology was developed mainly to give an answer to the problem of analyzing open-ended questions, it can be applied to any kind of frequency/contingency table with external variables. Introduction Frequency tables are a common data structure in very different domains such as ecology (specie abundance table), textual analysis (documents × words table) and public information systems (administrative register such as mortality data). This type of table counts the occurrences of a series of events (species, words, death causes) observed on different units (ecological sites, documents, administrative areas). Correspondence analysis (CA) is a reference method to analyse this type of tables offering the visualization of the similarities between events, the similarities between units and the associations between events and units (Benzécri, 1973; Lebart et al., 1998; Murtagh, 2005; Greenacre, 2007; Beh and Lombardo, 2014). However, this method presents two main drawbacks, when the frequency table is very sparse: 1. The first axes frequently show the relationships between small sets of units and small sets of events and do not reveal global trends. 2. The interpretation of the similarities/oppositions among units cannot be understood without taking into account the unit characteristics (such as, for example, climatic conditions, socioeconomic description of the respondents or economic characteristics of the area). In order to solve these drawbacks, contextual variables are also observed on the units and introduced in the analysis. A first step consists of grouping the units depending on one categorical variable and building an aggregated frequency table (AFT) crossing the categories (rows) and the events (columns). In this AFT, the former row-units corresponding to the same category are now collapsed into a single row while the event-columns remain unchanged. Then, CA is applied on this AFT, often called, in textual analysis, aggregated lexical table (ALT; Lebart et al., 1998). CA on the aggregated lexical table (CA-ALT) usually leads to robust and interpretable results. CAALT visualizes the similarities among categories, the similarities among words and the associations between categories and words. The same approach can be applied in other domains. The main drawback of CA-ALT is its restrictiveness. Only one categorical variable can be considered while often several categorical and quantitative contextual variables are available and associated to the events. Recently, correspondence analysis on generalised aggregated lexical tables (CA-GALT; BécueBertaut and Pagès, 2015; Bécue-Bertaut et al., 2014) has been proposed to generalize CA-ALT to the case of several quantitative, categorical and mixed variables. CA-GALT brings out the relationships between the vocabulary and the several selected contextual variables. This article presents an R function implementing CA-GALT in the FactoMineR package (Lê et al., 2008; Husson et al., 2010) and has the following outline: We first describe the example used to illustrate the method and introduce the notation. Then we recall the principles of the CA-GALT methodology and proceed to detail the function and the algorithm. Subsequently, the results obtained on the example are provided. Finally, we conclude with some remarks. The R Journal Vol. 7/1, June 2015 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 110 Example The example is extracted from a survey intended to better know the definitions of health that the nonexperts give. An open-ended question “What does health mean to you?” was asked to 392 respondents who answered through free-text comments. The documents × words table is built keeping only the words used at least 10 times among all respondents. This minimum threshold is used to obtain statistically interpretable results (Lebart et al., 1998; Murtagh, 2005). Thus, 115 different words and 7751 occurrences are kept. The respondents’ characteristics are also collected. In this example, we use age in groups (under 21, 21–35, 36–50 and over 50), gender (man and woman) and health condition (poor, fair, good and very good health) as they possibly condition the respondents’ viewpoint. CA-GALT is able to determine the main dispersion dimensions as much as they are related to the respondents’ characteristics. Notation The data is coded into two matrices (see Figure 1). The (I × J) matrix Y, with generic term yij, contains the frequency of the J words in the I respondents’ answers. The (I × K) matrix X, with generic term xik, stores the K respondents’ characteristics, codified as dummy variables from the L categorical variables.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiple Factor Analysis for Contingency Tables in the FactoMineR Package

Abstract We present multiple factor analysis for contingency tables (MFACT) and its implementation in the FactoMineR package. This method, through an option of the MFA function, allows us to deal with multiple contingency or frequency tables, in addition to the categorical and quantitative multiple tables already considered in previous versions of the package. Thanks to this revised function, e...

متن کامل

A Lexical Analogy to Feature Matching and Pose Estimation

We relate the problem of finding a correspondence between sensed and model features to that of finding a match between a random set of letters and words in a dictionary. The process is equivalent to hashing and the lexical perspective illuminates items such as design tradeoffs, computational complexity, and hashing function definition. A method for two-dimensional pose estimation based on this ...

متن کامل

Lexical Cohesion in English and Persian Abstracts

This study compares and contrasts lexical cohesion in English and Persian abstracts of Iranian medical students’ theses to appreciate textualization processes in the two languages. For this purpose, one hundred English and Persian abstracts were selected randomly and analyzed based on Seddigh and Yarmohamadi’s (1996) lexical cohesion framework, a version of Halliday and Hasan’s (1976) and Halli...

متن کامل

The CHIC Analysis Software v1.0

In this paper we describe CHIC (Correspondence & HIerarchical Cluster) Analysis, a specialized software package for Correspondence Analysis-CA (Simple and Multiple) and Hierarchical Cluster Analysis (Benzécri’s chi-square distance, Ward’s linkage criterion). The implementation of CA is in line with both the French approach and the Gifi System of data analysis. CHIC Analysis combines the graphic...

متن کامل

The Prevalence of Asthma and Declared Asthma in Poland on the Basis of ECAP Survey Using Correspondence Analysis

Results of epidemiological and public health surveys are often presented in the form of cross-classification tables. It is sometimes difficult to analyze data described in this way and to understand relations between variables. Graphical methods such as correspondence analysis are more convenient and useful. Our paper describes an application of correspondence analysis to epidemiological resear...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015